Abstract

The analysis of time course data is non-trivial as in involves dependencies between data points. Our package helps researchers analyze RNA-seq datasets which include gene expression measurements taken over time. The methods are specifically designed for at datasets with a small number of available replicates and time points is small, which is a typical case for RNA-seq time course studies. Short time courses are more difficult to analyze, as many statistical methods designed for time series data might require a minimum number of time points, e.g. functional data analysis (FDA) and goodness of fit methods might be ineffective. Our approach are non-parametric and distance-based which gives the user a flexibility to incorporate different normalization techniques, and distance metrics. vistimeseq is a comprehensive time course analysis toolbox with methods for data visualization, clustering and differential expression analysis. Additionally, the package can perform enrichment analysis for DE genes.

Introduction

We will demonstrate the effectiveness of vistimeseq package using the in-house dataset exploring the role of Cop1 in pro-inflammatory response. The experiments were designed to study the induction and repression kinetics of genes in bone marrow derived macrophages (BMDMs).

The dataset includes cells from 6 mice. The cells were divided into two equal groups. For one group the Cop1 gene was in vitro knock out with tamoxifen. All samples where then subject to LPS treatment to induce an inflamatory response. Bulk RNA-seq was performed at 6 time-points: one at time 0 before LPS treatment, then at time 2.5, 4, 6, 9 and 13 hours after LPS was added.

Obtain and process data

First we load the necessary packages.

.packages <- c("dplyr", "edgeR", "ggplot2", "tibble", "tidyr", 
               "viridis", "vistimeseq", "Biobase") 
sapply(.packages, require, character.only = TRUE)
##      dplyr      edgeR    ggplot2     tibble      tidyr    viridis vistimeseq    Biobase 
##       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE
theme_set(theme_bw())
theme_update(text = element_text(size = 15))

The dataset from the study is shipped together with the package as an ExressionSet object. We can read the data as follows:

# Load gene data
# WILL CHANGE TO GEO IMPORT
cop1_eset <- readRDS("../../external/NGS1471_esetCounts.rds")

We can easily convert ExpressionSets to vistimeseq using vistimeseqFromExpressionSet() funtion, indicating columns with relevant information. In vistimeseq the time variablie must be numeric, so we first convert need to it.

cop1_vistimeseq <- vistimeseqFromExpressionSet(
  cop1_eset, time = "time", group = "genotype", replicate = "individual")
## Error in vistimeseqFromExpressionSet(cop1_eset, time = "time", group = "genotype", : time column must be numeric.
cop1_vistimeseq
## Error in eval(expr, envir, enclos): object 'cop1_vistimeseq' not found
head(phenoData(cop1_eset)$time)
## [1] "T0"  "T6"  "T13" "T4"  "T4"  "T4"
# convert timpoint data T0, T2.5, T4, ..., T13 to numeric values
pData(cop1_eset)$time <- gsub("T", "", pData(cop1_eset)$time)
pData(cop1_eset)$time <- as.numeric(pData(cop1_eset)$time)
cop1_vistimeseq <- vistimeseqFromExpressionSet(
  cop1_eset, time = "time", group = "genotype", replicate = "individual")
cop1_vistimeseq
## An object of class "vistimeseq". 
## Project name: 'vistimeseq' time course project 
## Raw data: 36528 genes across 36 samples.

Alternatively, we can build vistimeseq object from each of the component. Here we read each of the fields of the ExpressionSet using Biobase package.

First, we extract the gene phenotype data.

# Gene data
gene_data <- fData(cop1_eset) 
gene_data[["feature"]] <- rownames(gene_data)
gene_data <- gene_data %>%
  select(feature, symbol, size, type, desc)
head(gene_data)

Then, we read and process the sample data. The information we need about each sample is (1) the experimental group it belongs to – here whether a sample corresponds to a wild type mouse or one with Cop1 knocked out, (2) the replicate id – as multiple samples where collected for the same conditions, (3) and most importantly time – the point in the course of the study at which a sample was collected.

# Sample data
smp_data <- pData(cop1_eset) %>%
  rename(sample = SAMPLE_ID) %>%
  mutate(
    tmp = factor(time, levels = c("T0", "T2.5", "T4", "T6", "T9", "T13")),
    time = as.numeric(gsub("T", "", time)),    
    replicate = paste0(genotype, "_", individual),
    label = paste0(genotype, "_", individual, "_", time),
    group = genotype
  ) %>%
  arrange(group, individual, time) %>%
  select(sample, group, individual, replicate, time, treatment, label)
rownames(smp_data) <- smp_data$sample
head(smp_data)

Finally, we also extract the expression data – the raw gene read counts.

cnts <- exprs(cop1_eset)[, smp_data$sample]        # count data
dim(cnts)
## [1] 36528    36

Building vistimeseq object

We can now combine all the data, cnts, gene_data and smp_data, into a single vistimeseq (S4) object. The data will store everything together in a way that is easier to perform further time course data analysis. The most important fields in the object are “raw.data”, “sample.data” which will contain information on group, replicate and time associated with each sample. As everything we need is stored in the “vistimeseq” object generated, we can discard all the data we generated previously.

cop1_timeseq <- vistimeseq(
  project = "[NGS1474] Time course in vitro Cop1 KO in BMDMs",
  raw.data = cnts,
  feature.data = gene_data,
  sample.data = smp_data,
  time_column = "time",
  replicate_column = "replicate",
  group_column = "group"
) 
rm(cnts, gene_data, smp_data, cop1_eset)
cop1_timeseq
## An object of class "vistimeseq". 
## Project name: [NGS1474] Time course in vitro Cop1 KO in BMDMs 
## Raw data: 36528 genes across 36 samples.

Data normalization and filtering

The raw read counts cannot be immediately used for analysis, as sequencing data involves the issue of varying sample depths. We can convert the raw counts to counts per million (CPM) using vistimeseq function, normalize_data(), which performs data normalization by column. Currently, we only support scaling sample counts by constant factors (size factors). If the argumennt column.scale.factor is not specified, by default vistimeseq divides by column sums and multiplies by 1 million to obtain CPMs. The normalized data is stored separately from the raw data in a slot “data”.

# Compute CPMs and store in cop1_timeseq@data 
cop1_timeseq <- normalize_data(cop1_timeseq) 

Since the dataset contains more than 36k genes, we will filter out the very rare ones which we assume to be too noisy and not containing enough signal for further analysis.

Here, we find and remove all genes which have the mean expression (CPM) below min_mean_cpm = 50 within either of the two groups of interest, wild type or knock-out. We set a very large threshold of 100 for this vignette only because the graphichs below would make the size of this vignette too large. When running on the your own study you should pick the threshold more carefully.

# Fine genes with minimum mean count of 5 at least in one of the two groups
min_mean_cpm <- 5
group_cpm_means <- data.frame(row.names = feature_names(cop1_timeseq))
for(g in unique(get_group(cop1_timeseq))) {
  g_cnts <- get_data(cop1_timeseq)[ , which(get_group(cop1_timeseq) == g)]
  group_cpm_means[, g] <- apply(g_cnts, 1, mean)
}
group_cpm_max <- apply(as.vector(group_cpm_means), 1, max)
genes_expressed <- feature_names(cop1_timeseq)[group_cpm_max > min_mean_cpm]
# Filter out the noisy genes
cop1_timeseq <- filter_features(cop1_timeseq, genes_expressed) 
cop1_timeseq
## An object of class "vistimeseq". 
## Project name: [NGS1474] Time course in vitro Cop1 KO in BMDMs 
## Raw data: 9356 genes across 36 samples.

Note that here, we remove the majority of features and our results might be inaccurate. However, this vignette’s purpose is only to inform how the package work and not intended as the whod dataset analysis. We do this because of Bioconductor’s constraint on file sizes.

Collapse replicates

In parts of our later analysis, we will make comparisons between genes, and therefore it is useful to aggregate gene expression accross replicates to obtain their mean behaviour. To do this we can use collapse_replicates() function for vistimeseq. The function saves collapsed sample and aggregated expression data in “sample.data.collapsed”, and “data.collapsed” respectively.

# Collapsed sample data stored in cop1_timeseq@sample.data.collapsed, 
# and mean expression values in cop1_timeseq@data.collapsed
cop1_timeseq <- collapse_replicates(cop1_timeseq, FUN = mean)

Time course format

The main focus on vistimeseq is to analyze and visualize time-series data efficiently. For this reason, we convert the expression data in a form of a rectancular matrix into a “time-course format” where each row stores a single time series corresponding to a specified combination of group membership and and replicate id (here mouse id). This data wranggling step can be performed with convert_to_timecourse() function, the “time-course” will be stored in a slot timecourse.data. This slot contains a list containing data.frames tc and tc_collapsed (if data.collapsed is defined).

Before converting data to “time-course” format, gene transformation should be performed. Transformation is thus allows for a fair gene-to-gene comparison. For example, when clustering genes, one uses pairwise distances to estimate dissimilarities between genes. Since, we are generally more interested in grouping genes based on similarities in their trajectories rather than their absolute expression levels, the read counts must be transformed before computing the distances.

Currently, gene transformation methods available in vistimeseq are “scale_feat_sum” (scaling by gene sum) or “var_stab” (variance stabilization). The user can specifiy a variance stabilization method if “var_stab” is used. VST methods supported are: “log1p” (log plus one), “asinh” (inverse hyperbolic sine) or “deseq” (DESeq2::varianceStabilizingTransformation).

Usually simply scaling by the gene sum, that is normalizing so that the total abundance the same (and equal to 1) for all genes gives good clustering of gene trajectories.

# Before conversion, scales gene expression to an even sum, for a fair 
# gene-to-gene comparison.
cop1_timeseq <- convert_to_timecourse(
  cop1_timeseq, feature.trans.method = "scale_feat_sum")

Lag differences

Time-series data have a dependency structure and are not standard multivariate vectors. Many methods have been developed for representing time-series data. A common technique is for example to fit functions e.g. polynomials or splines to the data. A similar approach is taken in functional data analysis (FDA) literature, where time series are represented as linear combinations of basis functions (e.g. principal functions). These methods seek to smooth the data and to reduce of complexity of the inherent (infinite dimensional) functional data. The fitted coefficients are often used as the time-series representatiom then used for clustering or visualization.

Unfortunately, most of the biological time course studies are short, sometimes containing as few as three or four time-points. Therefore, fitting functions to sparse time points would be inefficient. Instead, here we propose a simpler way to incorporate the dependecy structure of the time-series. Our method involves construction of additional data features, which are lag differences between consecutive time-points. Lag of order 1 for time-point \(i\), \(Lag_1(i)\), denotes the difference between the expression level at time \(i\) and time \(i-1\), lag 2 is the difference between time \(i\) and time \(i-2\), and so on. Intuitively, the lag 1 at time \(i\) approximates the slope or the first derivative of the time series curve at time-point \(i\).

We can add these extra lag features to the “time-course” data using add_lags() function, whcih appends lag features to “tc” and “tc_collapsed” data frames in the slot timecourse.data. Additionally,the user can define the weight for each lag feature by specifying the lambda argument. The length of lambda indicates how many orders of lags you would like to include, e.g. lambda = c(0.5, 0.25) means lag order 1 will be added with multiplicative weight of \(0.5\) and lag order 2 will be added with weight \(0.025\).

# Add lags to time-course data
cop1_timeseq <- add_lags(cop1_timeseq, lambda = c(0.5, 0.25))
head(time_course(cop1_timeseq))
head(time_course(cop1_timeseq, collapsed = TRUE))

At this point, we completed all data pre-processing steps available in vistimeseq. In later sections we specify how visualization, clustering and differential expression test can be performed with the package.

Data Visualization

In this section we show plotting utilities available in vistimeseq. Visualizations are data exploration tools and serve as the first step in our data analysis. In the following subsections we will describe more in details how heatmaps and PCA plots can be generated.

Heatmaps

Here we will generate a heatmap of top 100 most variable genes. The plot of the expression matrix for these most variable features will give us some insight whether there is a clear difference between the two exprerimental groups and whether a strong temporal trend can be detected.

plot_heatmap(cop1_timeseq, num.feat = 100)

In the above heatmap the colums are ordered by experimental group, replicate (mouse id) and time at which the sample was sequenced; the sample membership is indicated in the color bars on top of the columns. The main heatmap rectangle shows Z-scores of expression values represented by colors in the red-and-blue palette corresponding to high-and-low respectively. Even this first look at the data, shows us patterns present in the data – within each condition, i.e. each mouse and in each experimental group there are expression levels seem to be dependent on time.

PCA

Another way to explore the dataset is through dimensionality reduction. Here we will project the data into a space of principal componets. With PCA, you can visualize both samples and features in the same coordinates space with a biplot. Here we will keep these two maps separately, as the visualization can become overcrowded with points which obscure the inherent structure. Eventhough, we are plotting the feature and sample projections separately, they can be compared side by side to see which groups of features are more correlated with which group of samples.

Visualizing Samples

First, we project samples on a 2D map to check whether their relative location reflects time at which the sequencing was performed. If the samples are ordered in agreement with time in the PCA plot, there might be patterns in gene expression levels changing over the course of the study.

A PCA plot can also be used to examine whether samples corresponding to the same condition, here wilde type or knockouts, tend to group together, i.e. whether they are more similar to each other than to the ones in a different condition. We use prcomp() function from stats (default) package to compute PCA projection.

RNA-seq data is highly heteroscedastic, which means the features (genes) included have vastly different variances. It is know that bulk expression count data can be well modelled with a negative binomial distribution. In this distribution, variance is a quadratic function of the mean, \(\sigma^2 = \mu + \alpha\mu^2\). That is the higher the mean expression level, the higher the variance is. PCA projection maximizes the amount variance preserved in consecutive principal components, which implies that computing PCA on raw expression counts or even the RPKMs or CPMs would put too much weight on most highly abundant genes.

This step is recommended because RNA-seq data is highly heteroscedastic, which means the features (genes) included have vastly different variances. It is know that bulk expression count data can be well modelled with a negative binomial distribution. In this distribution, variance is a quadratic function of the mean, \(\sigma^2 = \mu + \alpha\mu^2\). That is the higher the mean expression level, the higher the variance is.

Thus, we recommend user to variance stabilize the data before computing a PCA projection (as described in Time course format section). The variance stabilization method can be specified in var.stabilize.method argument.

Additionally, the user might limit calculations to only specific group of samples, e.g. in this case you might be interested in visualizing samples only in the wild type group. A user can also indicate whether PCA should be applied to sample resolved data or one with replicates aggregated
(stored in “data.collapsed” slot of a vistimseq object).

cop1_timeseq <- run_pca(cop1_timeseq, var.stabilize.method = "asinh") 
plot_sample_pca(cop1_timeseq, col.var = "group", size = 5) 

plot_sample_pca(cop1_timeseq, col.var = "time", size = 5) 

In the plots above we see that the samples group mostly by time at which they have benn sequences. For some time-points, we see also a clear separation between sample from different experimental groups.

Visualizing Features

PCA also provides a projection of features to the same principal component space. The coordinates of features (here genes) are commonly referred to as PCA loadings. Since gene expression datasets usually includes thousands of genes, it is not possible to include labels for all of them in the same 2D PCA plot. Apart from that, in a time-course study one is usually interested in trajectories of genes over time, and it is good to see groups of genes with similar expression pattern clustered together on a visualization.

In order to make PCA plots for features more informative, we overlay average (over replicates) gene expression trend over time for each experimental group. Plotting trajectories for every gene would make plot overcrowded and unreadable, therefore we divide the PCA plot into \(m \times n\) grid and plot a trajectory for a gene whoch PCA coordinates are closes to the grid center point.

In the feature PCA plot below we see that genes exhibit different responses to LPS treatment. The figure shows genes organized according to their trajectory. Inhibited genes tend to gather on the left side of the plot, the primary response (early spike) genes are at the bottom, and the late response (late increase) clustered at the top. The plot doesn’t show a global difference between the wild type and knock-out group. Most genes have trajectories that are exactly overlapping in both conditions. There are, however, a few genes for which we do observe some difference between groups.

plot_ts_pca(
  cop1_timeseq, 
  linecol = c("WT" = "#e31a1c", "Loxp" = "#1f78b4"),
  main = paste("Visualizing gene profiles with PCA"), 
  col = adjustcolor("grey77", alpha=0.7))

Gene Clustering

To cluster the gene trajectories we will use the data stored in timecourse.data slot of cop1_timeseq. These are time-series of transformed expression values together with appended time lags which relsoving differences between the temporal trends.

We use hierarchical clustering define gene groupings. Either static or dynamic branch cutting (from dynamicTreeCut package) algorithms can be used to assign clusters. Since hierarchical clustering is computationally intensive (with \(O(n^3)\) complexity for standard implementations), we apply it only to a subset of genes. Specifically, we pick n_top_genes with average (over replicates) most variable expression over time in each of selected groups (here we use both wild type and knock-out) to perform clustering. Remaining genes are, then, assigned to a cluster with the closest centroid. An additional advantage of using only a subset of most variable genes for clustering is that, the core genes which exhibit negligible changes over time (which might be the majority of genes) will not much effect on clustering results.

Here we pick 1000 genes for clustering, whcih is roughly 1/3 of the number of genes after filtering.

params_for_clustering <- list(
  dynamic = TRUE, 
  dynamic_cut_params = list(deepSplit = TRUE, minModuleSize = 150))
  
cop1_timeseq <- cluster_timecourse_features(
  cop1_timeseq, n_top_feat = 3000, groups = "all",
  clust_params = params_for_clustering)

We can see the size of each of clusters computed

# See the count of genes in each cluster
cluster_map <- get_cluster_map(cop1_timeseq)
table(cluster_map$cluster)
## 
##   C1   C2   C3   C4   C5   C6 
## 2873  979 2418 2201  407  478

We can plot the hclust dendrogram obtained from hierarchical clustering performed.

# Plot the hierarchical clustering of genes2cluster
hclust_obj <- get_cluster_hclust(cop1_timeseq) 
plot(x = hclust_obj, labels = FALSE, xlab = "genes", sub = "")

Here we plot average (over replicates) gene trajectories grouped into 10 clusters found using the above described approach. The expression profiles for wild type and knock-out are plotted separately, side by side.

plot_ts_clusters(cop1_timeseq, transparency = 0.2) +
  scale_color_manual(values = c("WT" = "#1f78b4", "Loxp" = "#e31a1c"))

Timecourse plots above show genes clustered clearly into groups related to their pattern of response to LPS treatment. We see cluster C1, C2, and C4 generally inhibited. Genes in C6 spike right after LPS was applied. C3 shows moderate secondary response, and C5 exhibit late increase in expression.

Differential Expression Ranking

In the previous section we grouped the genes based on the trajectories recorded for both experimental groups. In this section we will describe how to find specific genes that exhibit different expression patterns between two experimental groups over the time-course of the study.

Differential Pointwise Expression

In some cases, a user might be interested in differential expression (DE) at specific timepoints between different experimental groups, here wild type and knock-out. We can easily test differental expression at any timepoint over the course of the study using standard DE approaches.

vistimeseq provides a wrapper timepoint_de() for differental expression testing functions (voom() + limma()) from limma package, which allows users to easily apply testing to vistimeseq objects.

Information on DE genes e.g. at timepoint 2.5 can be access as follows:

tmp_de <- get_diff_expr(cop1_timeseq, "timepoint_de")
# First 6 DE genes at timepoint = 2.5:
head(tmp_de$`2.5`) 

To find genes with the highest log-fold change in each timepoint we can call the following commands:

top10_genes <- sapply(tmp_de, function(tab) {
  top10 <- tab %>% arrange(-logFC) %>% .[["symbol"]] %>% .[1:10]
  if(length(top10) < 10) 
    top10 <- c(top10, rep(NA, 10 - length(top10)))
  return(top10)
}) %>% as.data.frame()
top10_genes

We can also find a list of genes which were found differentially expressed at any of the timepoints

de_genes <- lapply(tmp_de, function(x) x$symbol)
de_any_tmp <- unique(unlist(de_genes))
cat("Out of all", n_features(cop1_timeseq), ", there were", 
    length(de_any_tmp), "were found differentially expressed at any timepoint.")
## Out of all 9356 , there were 997 were found differentially expressed at any timepoint.

A useful diagram would show intersections of differentially expressed genes at different time-points. You should expert more intersection between consecutive time-points. You can use functions from UpSetR package to show the overlap between the DE genes across timepoints:

library(UpSetR)
upset(fromList(de_genes), 
      sets = rev(c("2.5", "4", "6", "9", "13")), keep.order = TRUE,
      number.angles = 30, #point.size = 3.5, line.size = 1,
      mainbar.y.label = "DE Gene Intersections", 
      sets.x.label = "DE Genes Per Timepoint")

Differential Trajectories

Instead of looking at each time-point separately, it is often useful to indentifying genes which exhibit different expression trajectories, i.e. ones with differential kinetics over time. We take a most natural approach which can be applied to short time-course datasets, which is an analysis of variance. In particular we use a method based on non-parametric permuntation multivariate analysis of variance (MANOVA).

To test each gene for differential trajectory under two conditions, we use the time-course format, where each row is a transformed time-series with lags corresponding to a single replicate within a particular group. adonis() function from vegan package is applied to find genes with differential trajectories. adonis apporach is based on partitioning the sums of squares of the multivariate (here time-series with lags) data to within and between-class. The significance is determined with an F-test on permitations of the data.

Using this procedure, here we will identify genes which have different trajectories within the wild type and knock-out group. This difference is determined when time series replicates (expression profiles) are more different between the groups than within the same group.

With a small number of available replicates (3 wild type and 3 knockout), a permutation based method does not yield high power. Combined with multiple hypothesis testing correction (for testing thousands of genes), we expect the method’s p-values to be mostly below significance level of \(\alpha = 0.5\). However, this approach is still useful, as we can user raw (unadjusted) p-values to filter out the genes that are with high probability not significant, and use the \(R^2\) value (the percenteage variance explained by groups) for ranking the genes in terms of the difference in expression profiles between two groups.

Function trajectory_de() can be used to find differential genes using the above described method. Results of testing procedure are saved in cop1_timeseq@diff.expr$trajectory_de.

cop1_timeseq <- trajectory_de(cop1_timeseq, dist_method = "euclidean") 
de_trajectory_res <- get_diff_expr(cop1_timeseq, "trajectory_de")
head(de_trajectory_res)
## NULL

You can filter out the genes based on the \(R^2\) value (the percenteage variance explained by groups). There are 372 genes with \(R^2 \ge 0.7\) which constitutes a fraction of 0.0397606 of all the genes in the dataset.

Here we print out 30 of the genes with highest \(R^2\) value and the pvalue equal 0.1. P-value of 0.1 is the minimum possible when performing permutation test on distances between 6 observations split evenly into 2 groups (0.1 = 2/(6 choose 3)).

# Select top most different genes across two groups
n_genes <- 30
genes_with_de_trajectory <- de_trajectory_res %>% 
  filter(pval <= max(0.05, min(pval)), R2 > 0.7) %>%
  arrange(-R2)
(genes_to_plot <- genes_with_de_trajectory$feature[1:n_genes])
##  [1] "26374"     "320782"    "104156"    "27355"     "16763"     "13039"     "17394"     "66455"     "240913"    "219144"    "15370"     "213956"    "15530"     "100038947" "14118"     "17105"    
## [17] "67465"     "18600"     "237256"    "58217"     "58182"     "64297"     "14281"     "246154"    "11520"     "15365"     "11534"     "217364"    "76267"     "12297"
plot_time_series(cop1_timeseq, features = genes_to_plot) +
  scale_color_manual(values = c("WT" =  "#1f78b4", "Loxp" = "#e31a1c")) 

Note, that in this analysis of variance approach we can specify a distance metric. Additionally, if you are interested in finding out genes DE in specific time intervals e.g. beginning or end of an experiment, you can choose to keep only the timepoints of interest are remove the rest from the time-courses stored in cop1_timeseq@timecourse.data$tc. trajectory_de() would then compute the distance for the specified time period and return DE genes at these specific time intervals.

Functional pathways

In this section we will describe procedures to find functional pathways/gene sets corresponding to genes exhibiting differential expression profiles. We use publicly available reference databases to find relevant pathways.

You can select any list of genes for this step. However, the method is intended for sets of genes found differentially expressed using methods provided by vistimeseq, discussed in the previous section.

Below, we use genes with DE trajectory, selected in the previous section. There are 372 in the set:

length(genes_with_de_trajectory$feature)
## [1] 372

Multiple functional pathways might be affected by knocking-out Cop1 gene. Therefore, we expect that genes found differential expressed can be parts of distinct pathways. Selected DE genes can be grouped by their cluster membership found earlier.

Below we plot expression profiles of selected DE genes, separated according to their cluster assignment.

plot_ts_clusters(cop1_timeseq, transparency = 0.5, 
                 features = genes_with_de_trajectory$feature) +
  scale_color_manual(values = c("WT" = "#1f78b4", "Loxp" = "#e31a1c"))

We can test sets (clusters) of DE genes individually for pathway enrichment using pathway_enrichment() which is built on top of goanna() and kegga() functions from limma package. The features argument provided to pathway_enrichment() must be Entrez Gene IDs. The user must also specify the relevant species. Optionally, the user can modify the filtering parameters applied to enrichment results: fltr_DE, fltr_N, and fltr_P.DE. For details see the documentation ?pathway_enrichment.

enrich_res <- pathway_enrichment(
  object = cop1_timeseq, 
  features = genes_with_de_trajectory$feature, 
  species = "Mm", ontology ="BP")

We can plot enrichment results for genes with differential trajectory in each cluster separately using plot_enrichment(). Here we plot the top n_max = 15 terms for each cluster. Next to enrichment terms, in the brackets, we have included information on the exact number of DE genes in a set (DE) and the size of the reference enrichment term (N). The points are colored by this fraction. The size of the points correspond the enrichment term size (N).

Below we print we print for example a plot for cluster C2 and C6

clst <- "C2"
plot_enrichment(
    enrich = enrich_res[[clst]], n_max = 15) +
  ggtitle(paste0("Cluster , ", clst, " enrichment"))

clst <- "C6"
plot_enrichment(
    enrich = enrich_res[[clst]], n_max = 15) +
  ggtitle(paste0("Cluster , ", clst, " enrichment"))

Session Information

sessionInfo()
## R version 3.5.0 (2018-04-23)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] UpSetR_1.3.3        bindrcpp_0.2.2      Biobase_2.40.0      BiocGenerics_0.26.0 vistimeseq_0.99.0   viridis_0.5.1       viridisLite_0.3.0   tidyr_0.8.0         tibble_1.4.2       
## [10] ggplot2_2.2.1       edgeR_3.22.1        limma_3.36.1        dplyr_0.7.4         rmarkdown_1.9       knitr_1.20          BiocStyle_2.7.9    
## 
## loaded via a namespace (and not attached):
##  [1] nlme_3.1-137                bitops_1.0-6                matrixStats_0.53.1          bit64_0.9-7                 RColorBrewer_1.1-2          rprojroot_1.3-2             GenomeInfoDb_1.16.0        
##  [8] dynamicTreeCut_1.63-1       tools_3.5.0                 backports_1.1.2             R6_2.2.2                    vegan_2.5-1                 rpart_4.1-13                mgcv_1.8-23                
## [15] Hmisc_4.1-1                 DBI_1.0.0                   lazyeval_0.2.1              colorspace_1.3-2            GetoptLong_0.1.6            permute_0.9-4               nnet_7.3-12                
## [22] tidyselect_0.2.4            gridExtra_2.3               DESeq2_1.20.0               bit_1.1-12                  compiler_3.5.0              htmlTable_1.11.2            DelayedArray_0.6.0         
## [29] labeling_0.3                bookdown_0.7                scales_0.5.0                checkmate_1.8.5             genefilter_1.62.0           proxy_0.4-22                stringr_1.3.0              
## [36] digest_0.6.15               foreign_0.8-70              XVector_0.20.0              base64enc_0.1-3             pkgconfig_2.0.1             htmltools_0.3.6             htmlwidgets_1.2            
## [43] rlang_0.2.0                 GlobalOptions_0.0.13        rstudioapi_0.7              RSQLite_2.1.1               shape_1.4.4                 bindr_0.1.1                 jsonlite_1.5               
## [50] BiocParallel_1.14.1         acepack_1.4.1               RCurl_1.95-4.10             magrittr_1.5                GenomeInfoDbData_1.1.0      Formula_1.2-2               Matrix_1.2-14              
## [57] Rcpp_0.12.16                munsell_0.4.3               S4Vectors_0.18.1            stringi_1.1.7               yaml_2.1.18                 MASS_7.3-49                 SummarizedExperiment_1.10.1
## [64] zlibbioc_1.26.0             plyr_1.8.4                  grid_3.5.0                  blob_1.1.1                  lattice_0.20-35             splines_3.5.0               annotate_1.58.0            
## [71] circlize_0.4.3              locfit_1.5-9.1              ComplexHeatmap_1.17.1       pillar_1.2.2                GenomicRanges_1.32.2        rjson_0.2.15                geneplotter_1.58.0         
## [78] stats4_3.5.0                XML_3.98-1.11               glue_1.2.0                  evaluate_0.10.1             latticeExtra_0.6-28         data.table_1.10.4-3         gtable_0.2.0               
## [85] purrr_0.2.4                 assertthat_0.2.0            xfun_0.1                    xtable_1.8-2                survival_2.42-3             AnnotationDbi_1.42.1        memoise_1.1.0              
## [92] IRanges_2.14.7              cluster_2.0.7-1